Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems

نویسندگان

  • Syed Sarfaraz Akhtar
  • Arihant Gupta
  • Avijit Vajpayee
  • Arjit Srivastava
  • Manish Shrivastava
چکیده

With the advent of word representations, word similarity tasks are becoming increasing popular as an evaluation metric for the quality of the representations. In this paper, we present manually annotated monolingual word similarity datasets of six Indian languages – Urdu, Telugu, Marathi, Punjabi, Tamil and Gujarati. These languages are most spoken Indian languages worldwide after Hindi and Bengali. For the construction of these datasets, our approach relies on translation and re-annotation of word similarity datasets of English. We also present baseline scores for word representation models using state-of-the-art techniques for Urdu, Telugu and Marathi by evaluating them on newly created word similarity datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The IIT Bombay SMT System for ICON 2014 Tools Contest

In this paper, we describe our submission to the ICON 2014 Tools Contest for Machine Translation. The source languages are English, Marathi, Tamil, Telugu, Bengali and the target language is Hindi. We submitted 15 systems; 5 each for the tourism, health and general domains. Our submission is a Phrase-based Statistical Machine Translation system with preprocessing and post-processing elements. A...

متن کامل

A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets

Despite being one of the most popular tasks in lexical semantics, word similarity has often been limited to the English language. Other languages, even those that are widely spoken such as Spanish, do not have a reliable word similarity evaluation framework. We put forward robust methodologies for the extension of existing English datasets to other languages, both at monolingual and cross-lingu...

متن کامل

SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity

This paper introduces a new task on Multilingual and Cross-lingual Semantic Word Similarity which measures the semantic similarity of word pairs within and across five languages: English, Farsi, German, Italian and Spanish. High quality datasets were manually curated for the five languages with high inter-annotator agreements (consistently in the 0.9 ballpark). These were used for semi-automati...

متن کامل

Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure

We suggest a new method for creating and using gold-standard datasets for word similarity evaluation. Our goal is to improve the reliability of the evaluation, and we do this by redesigning the annotation task to achieve higher inter-rater agreement, and by defining a performance measure which takes the reliability of each annotation decision in the dataset into account.

متن کامل

Cost Effective Dependency Parsing for Indian Languages

Indian languages are MoR-FWO1 and hence differ from English in structure and morphology. There are many distinguished characteristics possessed by Indian languages. While working with these languages we have to keep in mind, these characteristics and plan strategies accordingly. We worked on improving Dependency Parsing for Indian Languages, more specifically for Hindi, an Indo-Aryan Language. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017